Evolving Strategies for Focused Web Crawling

نویسندگان

Judy Johnson

Kostas Tsioutsiouliklis

C. Lee Giles

چکیده

The rapid growth of the World Wide Web has created many challenges for both general purpose crawling, search engines and web directories, making it difficult to find, index, and classify web pages based on a topic. Topic driven crawlers can complement search engines because they pre-classify the pages retrieved by the crawl. To implement such a focused crawler, a strategy for ordering the crawl frontier is required. Such a strategy can only use information gleaned from previously crawled pages to estimate the relevance of a newly observed URL. Because the best strategy for ranking URLs in the crawl frontier is not immediately apparent, we discover strategies by evolving them using a genetic algorithm. Strategies are learned by evaluating the results of crawls simulated using a database generated by a previous, more general crawl. We conclude that a rank function that combines analysis of text and link structure yields effective strategies. The evolved strategies perform better than the commonly used Best First strategy.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Prioritize the ordering of URL queue in Focused crawler

The enormous growth of the World Wide Web in recent years has made it necessary to perform resource discovery efficiently. For a crawler it is not an simple task to download the domain specific web pages. This unfocused approach often shows undesired results. Therefore, several new ideas have been proposed, among them a key technique is focused crawling which is able to crawl particular topical...

متن کامل

Focused Web Crawling: A Generic Framework for Specifying the User Interest and for Adaptive Crawling Strategies

Compared to the standard web search engines, focused crawlers yield good recall as well as good precision by restricting themselves to a limited domain. In this paper, we do not introduce another focused crawler, but we introduce a generic framework for focused crawling consisting of two major components: (1) Specification of the user interest and measuring the resulting relevance of a given we...

متن کامل

A Study of Focused Web Crawlers for Semantic Web

Finding useful information from the web which has a large and distributed structure requires efficient search strategies. Focused crawlers selectively retrieve Web documents that are relevant to a predefined set of topics. To intelligently make decisions about relevant URLs and web pages, different authors had proposed different strategies. In this paper we review and compare focused crawling s...

متن کامل

Accurate and Efficient Crawling for Relevant Websites

Focused web crawlers have recently emerged as an alternative to the well-established web search engines. While the well-known focused crawlers retrieve relevant webpages, there are various applications which target whole websites instead of single webpages. For example, companies are represented by websites, not by individual webpages. To answer queries targeted at websites, web directories are...

متن کامل

Ontology-Focused Crawling of Web Documents and RDF-based Metadata

The enormous growth of the World Wide Web in recent years has made it important to develop document discovery mechanisms based on intelligent and focused crawling techniques. The next-generation Web, the Semantic Web, that is currently being developed as a meta Web, building on the existing one, changes the classical crawling task. Metadata that is based on ontologies will exist in the form of ...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 2003

Evolving Strategies for Focused Web Crawling

نویسندگان

چکیده

منابع مشابه

Prioritize the ordering of URL queue in Focused crawler

Focused Web Crawling: A Generic Framework for Specifying the User Interest and for Adaptive Crawling Strategies

A Study of Focused Web Crawlers for Semantic Web

Accurate and Efficient Crawling for Relevant Websites

Ontology-Focused Crawling of Web Documents and RDF-based Metadata

عنوان ژورنال:

اشتراک گذاری